Clustering Co-occurrence Graph based on Transitivity
نویسنده
چکیده
Word co-occurrences form a graph, regarding words as nodes and co-occurrence relations as branches. Thus, a co-occurrence graph can be constructed by co-occurrence relations in a corpus. This paper discusses a clustering method of the co-occurrence graph, the decomposition of the graph, from a graph-theoretical viewpoint. Since one of the applications for the clustering results is the ambiguity resolution, each output cluster is expected to have no ambiguity and be specialized in a single topic. We observed that a graph has no ambiguity if its branches representing co-occurrence relations are transitive. An algorithm to extract such graphs are proposed and its uniqueness of the output is discussed. The effectiveness of our m e t h o d is examined by an experiment using co-occurrence graph obtained from a 30M bytes corpus. 1 I n t r o d u c t i o n Clustering is t h e operation to group words by some criterion. Thesauri and synonym dictionaries are some of its manual examples. Automatic outputs can be used not only to revise them, but also to aid ambiguity resolution, an essential problem in natural language processing. For instance, the m e ~ i n g of an ambiguous word can be decided by e.xamln'i~g the duster it belongs to. Furthermore, clusters grouped according to topics have many application areas such as automatic document classification. The input in this paper is the word co-occurrence graph obta~ued from corpus. The output is its subgraphs with the condition that each subgraph is specialized in a topic. Many automatic clustering methods have been already proposed. Most of them are based on the statistical similarity between two words. Our approach is different; it is graph theoretical. We tried to find out the special structure in linguistic graph. Having a huge co-occurrence graph obtained from a corpus, we first tried to decompose it to analyze its graph structure using graph theoretical tools, such as maximum strongly connected components, or biconnected components. Although both tools decompose a graph into tightly connected subgraphs, these trials resulted in vain. The question arose; what must be taken into account to decompose the cooccurrence graph. 7 The answer is the ambiguity. Furthermore, we reached to the conclusion that the ambiguity can be explained in terms of intransitivity. This feature is developed into an algorithm for clustering. This paper is organized as follows. The following chapter describes the relationship between the transitivity in the graph and the ambiguity resolution. Chapter 3 shows the relationships between clustering and transitivity. Chapter 4 proposes and discusses an algorithm for clustering. Related work is resumed in Chapter 5. Our method is examined in Chapter 6 by some experiments. 2 W o r d A m b i g u i t y a n d T r a n s i t i v i t y Two words are said to co-occur when they frequently appear close to each other within texts. Regarding words as nodes and co-occurring re-
منابع مشابه
Graph-based Word Clustering using a Web Search Engine
Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By cal...
متن کاملSingle document Summarization based on Clustering Coefficient and Transitivity Analysis
Document summarization is a technique aimed to automatically extract the main ideas from electronic documents. With the fast increase of electronic documents available on the network, techniques for making efficient use of such documents become increasingly important. In this paper, we propose a novel algorithm, called TriangleSum for single document summarization based on graph theory. The alg...
متن کاملApproximating Clustering Coefficient and Transitivity
Since its introduction in the year 1998 by Watts and Strogatz, the clustering coefficient has become a frequently used tool for analyzing graphs. In 2002 the transitivity was proposed by Newman, Watts and Strogatz as an alternative to the clustering coefficient. As many networks considered in complex systems are huge, the efficient computation of such network parameters is crucial. Several algo...
متن کاملTransitivity and the Co-occurrence Relation in LSI
Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of Information Retrieval systems. Researchers use experimental methods to determine the appropriate number of dimensions for a given application. We propose the development of a theoretical foundation for determination of this parameter for LSI. We assert that LSI’s use of higher orders of co...
متن کاملClustering Protein Sequences ? Structure Prediction by Transitive Homology
MOTIVATION It is widely believed that for two proteins Aand Ba sequence identity above some threshold implies structural similarity due to a common evolutionary ancestor. Since this is only a sufficient, but not a necessary condition for structural similarity, the question remains what other criteria can be used to identify remote homologues. Transitivity refers to the concept of deducing a str...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997